Using HMM to learn user browsing patterns for focused Web crawling
نویسندگان
چکیده
A focused crawler is designed to traverse the Web to gather documents on a specific topic. It can be used to build domain-specific Web search portals and online personalized search tools. To estimate the relevance of a newly seen URL, it must use information gleaned from previously crawled page sequences. In this paper, we present a new approach for prediction of the links leading to relevant pages based on a Hidden Markov Model (HMM). The system consists of three stages: user data collection, user modelling via sequential pattern learning, and focused crawling. In particular, we first collect the Web pages visited during a user browsing session. These pages are clustered, and the link structure among pages from different clusters is then used to learn page sequences that are likely to lead to target pages. The learning is performed using HMM. During crawling, the priority of links to follow is based on a learned estimate of how likely the page is to lead to a target page. We compare the performance with Context-Graph crawling and Best-First crawling. Our experiments demonstrate that this approach performs better than Context-Graph crawling and Best-First crawling. 2006 Elsevier B.V. All rights reserved.
منابع مشابه
Future View: Web Navigation Based on Learning User?s Browsing Patterns
In this paper, we propose a Future View system that assists user’s usual Web browsing. The Future View will prefetch Web pages based on user’s browsing strategies and present them to a user in order to assist Web browsing. To learn user’s browsing patterns, the Future View uses two types of learning classifier systems: a content-based classifier system for contents change patterns and an action...
متن کاملFuture View: Web Navigation based on Learning User’s Browsing Patterns by Classifier Systems
In this paper, we propose a Future View system that assists user’s usual Web browsing. A Future View will prefetch Web pages based on user’s browsing strategies and present them to a user in order to assist Web browsing. To learn browsing patterns for a user, Future View uses two types of learning classifier systems: a content-based classifier system for contents change patterns and an action-b...
متن کاملPrioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملExtracting user web browsing patterns from non-content network traces: The online advertising case study
Online advertising is a rapidly growing industry currently dominated by the search engine ’giant’ Google. In an attempt to tap into this huge market, Internet Service Providers (ISPs) started deploying deep packet inspection techniques to track and collect user browsing behavior. However, these providers have the fear that such techniques violate wiretap laws that explicitly prevent interceptin...
متن کاملOn Learning Strategies for Topic Specific Web Crawling
Crawling has been a topic of considerable interest in recent years because of the rapid growth of the world wide web. In many cases, it is possible to design more effective crawlers which can find web pages belonging to specific topics. In this paper, we will discuss some recent techniques for crawling web pages belonging to specific topics. We discuss the following classes of techniques: (1) I...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Data Knowl. Eng.
دوره 59 شماره
صفحات -
تاریخ انتشار 2006